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Abstract. We describe a substring search problem that arises in group 
presentation simplification processes. We suggest a two-level searching 
model: skip and match levels. We give two timestamp algorithms which 
skip searching parts of the text where there are no matches at all and 
prove their correctness. At the match level, we consider Harrison signa- 
ture, Karp- Rabin fingerprint, Bloom filter and automata based matching 
algorithms and present experimental performance figures. 

1 Introduction 

A fundamental technique used in computer science is to search for a specific 
substring in a large body of text. Text-processing systems must allow their users 
to search for given character strings within a body of text. Database systems 
must be capable of searching for records with stated values in specified fields. 
Substring searching plays an important role in group presentation simplification 
processes too. Most of execution time of these processes is used in the substring 
search part. 

We describe the substring search problem that arises in presentation sim- 
plification processes, also known as Tietze processes, and give two timestamp 
algorithms which skip searching parts of the text where there are no matches 
at all. First we give the background of Tietze processes then we formalize the 
substring searching problem in this context and define the terminology used 
in this paper. We analyze the problem and give a two-level searching model. 
Based on this analysis, we study both the skip and match levels. We present 
two timestamp algorithms at the skip level and two minimal-cover theorems for 
those algorithms. At the match level, we consider algorithms and data struc- 
tures based on Harrison signatures, Karp-Rabin fingerprints, Bloom filters and 
automata. We indicate the practical performance of these in this context. 

Finitely presented groups have been much studied. All requisite mathematical 
background is provided in [14, Chapter 1]. An overview of algorithms for such 
groups is included in [5] , and a comprehensive book on computation with finitely 
presented groups [19] has recently appeared. 
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A finitely presented group may be given by a presentation G = (g\, . . . , gd \ 
R\, . . . , Rn) where the gi are generators and the Rj are relators. Generally speak- 
ing, presentations are good if they are short: few generators; and few relators 
of reasonable length. This makes them relatively intelligible to humans and also 
often makes them better suited for computer calculations. We are interested in 
the situation where we have what we regard as a bad presentation for a group 
and we wish to find a good presentation. This kind of situation may arise in a 
number of ways. 

A theorem of Tietze proves that, given two presentations of a group, there 
exists a sequence of simple transformations which demonstrates that the presen- 
tations are of the same group. However there is no general algorithm for finding 
such a sequence, a consequence of unsolvability results in group theory. Various 
Tietze transformation procedures, which input a "bad" presentation and out- 
put a "good" presentation for a group, have been described ([9, 17, 11]). Newer 
procedures, written in the higher level GAP [18] language (with special ker- 
nel support), have also been developed by Volkmar Felsch and Martin Schonert 
in Aachen. Three main principles used by Tietze transformation methods to 
simplify presentations are: short eliminations; long eliminations; and substring 
replacements. 

In each short elimination phase, all relators of length 1 and non-involutory 
relators of length 2 are used to eliminate generators and their associated relators. 
In each long elimination phase, redundant generators (generators which occur 
only once in some relator) and their associated relators are eliminated using 
relators with length greater than 2. 

In each substring replacement phase, relators are shortened by replacing long 
substrings with shorter equivalent strings. First substring searching is performed. 
A relator Ri is chosen and other relators are searched for a matching substring v 
in a rotation uv of Ri or its inverse and in a rotation wv of Rj , with the length of 
v greater than the length of u. Then, when such a match is found, the relator Rj 
is replaced by the shorter relator wu^ 1 . One substring replacement pass involves 
the application of this process with Ri running once through all relators in the 
presentation. 

Tietze processes function by working through these steps in some sensible 
order, guided by heuristics. Short eliminations and substring replacements reduce 
the total length of the presentation. Long eliminations can, and generally do, 
increase the length, often quite significantly. The substring searching component 
of substring replacements is by far the most time consuming part of Tietze 
processes, which is why we focus on it here. 

2 Definition of the problem 

For a presentation G = (<?i, . . . , gd I Ri, ■ ■ ■ , Rq), we let U denote the length of 
Ri and S denote the set of generators and their inverses {g\, gj" , ■ ■ ■ , gd> 9d } 
(the alphabet). Rel denotes the sequence of relators (R\, R2, R q -i,R q ), which 
is often kept sorted so that h < h < ■■■ < l q -i < l q - We define the substring 



searching (and replacement) problem that arises in Tictze transformation pro- 
cesses as: 

Given Rel over S, for any two relators, R p and R t € Rel with l p < l t , determine 
whether a common substring of length at least \(l p + l)/2] exists in equivalents of 
R p and R t and, if so, shorten R t . 

We use the following terminology. A useful common substring is a com- 
mon substring of Ri and Rj with length greater than half the length of the shorter 
of Ri and Rj. A common substring search between two relators (R p ,R t ) and 
their equivalents is denoted by ComStr(R p , R t ), while the more usual common 
substring search between two strings si and S2 is denoted by comsubstr(si, s 2 ) . 
R 1 denotes a string made from relator R by rotating it i positions right. The 
formal inverse of a string is obtained by reversing the string and inverting each 
symbol in the string (that is, replacing each gi by g^ 1 and vice versa.) The equiv- 
alents of a relator R which we consider are its rotations and their formal inverses. 
A pass is a substring replacement phase in which each pair of relators in Rel is 
considered once and only once for a ComStr(R p , R t ). If at least one of R p and 
R t has been changed since the previous ComStr(R pi R t ), then ComStr(R p , R t ) 
is a necessary search in the current pass, otherwise it is unnecessary since it 
is impossible that these relators have a useful common substring. We use R p to 
refer to a pattern relator and Rt to a text relator. 

We exemplify the performance of the various methods for substring searching 
applied to group presentations by considering some specific examples in detail. 
The performance gains demonstrated here typify the improvements achieved in 
this application area by these methods. 

We study three applications, giving performance on presentations J, T and 
1Z. Presentation J is of the index 100 subgroup in the Janko simple group J2, 
and comes from a subgroup presentation method (see [8]). It has 201 generators, 
510 relators with longest relator of length 12, and total relator length 2,795. 
Presentation T is of the index 152 subgroup in the Fibonacci group F(2, 9) 
and was obtained the same way. (It plays a crucial role in proving F(2, 9) to 
be infinite, see [12, 15].) Presentation T has 153 generators, 304 relators with 
longest relator of length 13, and total relator length 2,119. Presentation TZ is for 
the restricted Burnside group i?(2, 5), a group of order 5 34 . It has 34 generators 
and 595 relators with longest relator of length 41, and total relator length 3,443. 
It was derived from a nilpotent quotient algorithm (see [10]). 

3 Analysis of the problem 

Algorithms and data structures for substring searching in various situations have 
been much studied, see [1] and [6, Chapter 7] for example. However the case con- 
sidered here differs substantially from those covered there. Major distinguishing 
features of our situation arc: all strings are (in effect) circular; formal inverses are 
(implicitly) present; many substrings are simultaneously sought; and the text is 
dynamic, changing very often. In this section we study features of our problem. 



In our substring searching problem, each relator Ri can be thought of as 
representing 21 i strings: /, strings obtained by rotation; and another ij strings 
obtained by formal inversion. Thus Rel, which consists of q relators, represents 
2 J2i=i h strings. The common substring search process for a pair of relators R p 
and R t , ComStr(R p , R t ), can be concisely described in pseudocode in terms of 
2l p l t comsubstrs (common substring searches for strings) as follows, 
for i := to l p - 1 { for j := to l t - 1) { 

comsubstr(R p , R{): com_substr{(R p )^ 1 , r\)] } } 
A pass consists of the (*) choices of pairs of relators, that is, q(q — l)/2 
ComStrs. 

Since all rotations of Ri arc substrings of the string RiRi, it is not necessary 
to explicitly generate them all separately. A simple solution comes from relator 
extension. If we extend R p and R t by their initial l p /2 symbols to obtain R' p and 
R' t , then ComStr(R p , R t ) — {com_substr(R p , R' t ); com_substr(R~ 1 ,R' t );}. 

All of the relators which make up the presentation are used as patterns as 
well as texts. During the Tietze processing, they change frequently Eliminations 
(short and long) and successful replacement passes make changes to relators. 
However, not all relators are changed between substring replacement passes. We 
use a two-level substring searching model, the skip level and the match level, to 
speed up the substring replacement passes. 

At the skip level, unnecessary searches are identified and skipped. Early im- 
plementations of Tietze transformation programs compare each relator Ri with 
every subsequent relator in the relator sequence in every pass. The idea here is 
that pairs of relators already searched are not searched again. Havas and Ollila 
[11] used change flags to avoid unnecessary searches. Here we improve on change 
flags by using a timestamp system. Two timestamp algorithms are given in the 
next section. Since many relators are not changed in a pass, this speeds up the 
whole process tremendously over the early methods, as the cost of timestamping 
is negligible. 

The following practical results give the total number of relator pairs searched 
in equivalent Tietze processes on the given presentation. In the case of J , a total 
of 6,693,105 searches were made with the early method. Change flags reduced 
this to 482,959 searches, further reduced to 351,253 by timestamps. (Only 2,376 
of these were successful.) For T, the corresponding figures are: 9,513,358 (early 
method); 832,689 (change flags); 585,383 (timestamps); and 2,739 (successful). 
Thus over 90% of the ComStrs which were done with the early method are 
skipped if we use change flags; timestamps provide a further 27% saving. Since 
the time used for handling change flags or timestamps is insignificant, searching 
time is similarly reduced. 

At the match level, numerous variations are possible, with plenty of scope for 
improvement. This is because the successful search rate is very low, even when 
using timestamps, as illustrated above. The successful ComStrs (a useful com- 
mon substring found) comprise only 0.68% and 0.47% of the necessary ComStrs 
for J and T respectively. Thus, if we can detect that there is no useful common 
substring quickly then a substantial time saving may be achieved. 



4 Timestamps 



In this section we present two timestamp algorithms and two theorems. One 
algorithm deals with "sorted relators" and the other with "unsorted relators". 
Ri.Tp records the latest time when Ri is used as a pattern. Ri.T s records the 
latest time when Ri is changed in a ComStr(R p , Ri). 

The following algorithm is applicable when Rel is kept sorted all the time (as 
in [9, 11]). Note that Other operations refers to elimination phases, which are 
also responsible for updating R[*] .Tp, R[*] . Ts and NumRels, as appropriate. 

Initialize: R[*] .Tp := -1; R[*].Ts := 0; timer := 1; 
while (MoreSubstringSearchPass) { 
for p := 1 to NumRels-1 { 
for t := p+1 to NumRels { 
if (R[p] .Tp <= R[t] .Ts) { 
ComStr(R[p] , R[t] ) ; 
if R[t] changed { 

R[t] .Tp := -1; R[t].Ts := timer; reorder R[t] in Rel;} 

} > 

R[p] .Tp := timer; timer++; 

} 

Other operations; 

Compute MoreSubstringSearchPass; 

} 

Theorem 1 This algorithm performs all necessary searches and all searches the 
algorithm does are necessary. 

The proof is by detailed but straightforward analysis. 

The following algorithm is used when Rel is not necessarily kept sorted 
within a pass (as in [18]). This is applicable if Rk remains the fcth relator in 
Rel throughout a pass, no matter whether it is changed or not. However at the 
beginning of each pass Rel is sorted. 

Initialize: R[*] .Tp := R[*] .Ts := *; 
while (MoreSubstringSearchPass) { TsLocal [*] := 0; 
for p := 1 to NumRels-1 { 
for t := p+1 to NumRels { 
if (R[t].len >= R[p].len and ( (TsLocal [p] +TsLocal [t] ) ! =0 
or R[p] .Tp > R[t] .Tp or R[p] .Tp <= R[t] .Ts) ) 
{ ComStr(R[p] ,R[t] ) ; if (R[t] changed) TsLocal [t] := p; } 

} 

R[p] .Tp := p; R[p].Ts := TsLocal [p] ; 

} 

Sort (Rel) ; 

Other operations; 

Compute MoreSubstringSearchPass; 



Theorem 2 This algorithm performs all necessary searches and all searches the 
algorithm does are necessary. 

Again the proof is by detailed but straightforward analysis. 

5 Signatures, Fingerprints, Bloom niters, and Automata 

In this section, we study methods for the match level. We can achieve efficiencies 
if we can detect unsuccessful searches early. There are two categories of string- 
matching algorithms: exact match algorithms such as brute-force, Knuth-Morris- 
Pratt, Boyer-Moore and Boyer-Moore derivatives, and automaton-based ones; 
and algorithms that initially allow errors, such as those of Harrison and Karp- 
Rabin. All of these are described in [6]. 

In spite of the theoretical worst case inferiority of brute force searching, 
its average case performance is linear in the length of the text being searched. 
Furthermore, Gonnet and Baeza- Yates [6, Table 7.4] show that it performs quite 
well in practice. In [9] a variant of brute force searching which enables a search 
for many strings simultaneously at no extra cost was used. 

Thus, consider R p and R t with l p < l t . In order to shorten R t any useful 
common substring must have length greater than half the length of R p . This 
means that it will contain either the first symbol of R p or a middle symbol, 
or the inverse of one of those. (Further, if R p is a nontrivial power, a useful 
substring must contain the first symbol or its inverse. Also, generators which 
are known from the presentation to be involutions are known to be their own 
inverses.) So the search starts by searching for one of at most four symbols as 
starting points in R t . When such a match is found an attempt is made to extend 
the match circularly both backwards and forwards until it is long enough to be 
useful. 

The first use of algorithms which allow errors to save time in this context 
was by Havas and Ollila [11], based on ideas of Harrison [7]. The speed up comes 
from the replacement of some time consuming substring searches by much faster 
tests which reveal that no useful match is possible. Strings are characterized 
by signatures. Fast signature generation and comparison often determines that 
one string cannot be a substring of another much more quickly than explicit 
string searching. Havas and Ollila used rotation and inversion invariant signa- 
tures well-suited to this context and present detailed performance results. This 
approach was reasonably successful, but signature computation and comparison 
is by no means free. Havas and Ollila concluded that change flags outperformed 
signatures, and this result extends to timestamps. 

The Tietze procedures in GAP [18] use the Karp-Rabin algorithm in the 
substring searching part, combined with change flags. In this, strings are charac- 
terized by shorter entities called fingerprints. Efficiencies are achieved by manip- 
ulating fingerprints instead of the (possibly much longer) strings. The algorithm 
associates with each string X a fingerprint 4>{X). The search for a match initially 
compares short fingerprints. When a fingerprint match is found, an exact-match 



method (usually) has to be invoked to confirm whether the fingerprint match 
corresponds to an actual string match or is a false match. False matches may 
occur unless 4>{X) is a one-to-one mapping, which would be unusual. 

In GAP, a fingerprint (an integer) is associated with each minimal possibly- 
useful substring in each pattern relator and its equivalents. This means that 2l p 
strings of length \(l p + l)/2] are characterized by 2l p integers. Then fingerprints 
are computed for the l t length |~(Z p + l)/2] substrings of each text relator (and its 
rotations). In order to quickly search for fingerprint matches, the pattern finger- 
prints are stored in a type of hash table. The hash table is represented by a data 
structure called a Bloom filter [4], which reduces the amount of space required 
to contain the hash-coded information from that associated with conventional 
methods. The reduction in space is at the cost of some percentage of erroneous 
look-ups, which may be tolerable in some applications. The filter comprises a bit 
vector and several hash transformations. 

The Bloom filter in GAP is organized so that fingerprints are represented by 
3 bits, one bit in each of three bit-tables. Three hash functions compute table 
addresses for each fingerprint. When a match is found a brute force algorithm is 
then used to confirm whether it is an actual match, since both fingerprints and 
Bloom filters allow erroneous matches. 

As long as the Bloom filter is reasonably loaded, these Tietze procedures work 
well. They are fast and space efficient. However presentation 1Z causes problems. 
Almost all matches are false: 1,694,640 out of 1,716,314. Thus almost 99% of 
the matches are false, and there are only 21,674 actual matches. This leads to 
inordinate execution time, used in the brute force searches, and a total cpu time 
of about 10 hours on a fast Sparc machine. 

Where do these false matches occur? Are they false fingerprint matches? Or 
is it in the Bloom filters? 

We replaced the Bloom filters by an ordinary hash table (which is slower and 
uses more space). This revealed that only 283 out of 21,957 fingerprint matches 
are false, 1.2%. The total execution time is reduced to less than an hour, about 
9% of that using Bloom filters (but the ordinary hash table uses 8 times more 
space). This indicates that almost all false matches occur in the Bloom filters. A 
further study reveals that the false matches mainly occur at a late stage of the 
processing, when the size of the alphabet is 4 and the length of pattern strings 
is over 500. (With "small" examples, such as J and T the time taken by the 
ordinary hash table is about twice that taken by 3-bit Bloom filters.) Using 4 
bits instead of 3 bits to represent a fingerprint in Bloom filters (and using a 
similar hash function to produce the addresses of the fourth bit) reduces the 
number of false matches for 1Z to 432,383 and the execution time by about a 
factor of three compared to the 3-bit filter. Again the false matches occur in the 
late stage when the size of the alphabet is 4, but even later, becoming frequent 
when the length of pattern strings is over 10,000 symbols. 

Thus, fingerprints combined with Bloom filters provide an effective way of 
substring searching in this application. Except in the final stages of hard com- 
putations, when the filter may become overloaded, they are economical in both 



space and time. Alternative methods should be used in such final stages. 

We have investigated the use of automaton-based string searching for this 
application. Automata have been successfully used to search for single pattern 
and multiple patterns [2, 3]. Perleberg [16] presented a longest substring (LS) 
searching algorithm based on automata. This algorithm requires another table 
next length in addition to next state. The next length table gives the maximum 
length of a substring that ends in the next state with the restriction that the 
next state follows the current state. Directly extending a single pattern match 
automaton to the LS problem would require 0(m 2 \E\) space, 0(m 2 \U\ + m ) 
preprocessing time, and 0(n) running time (where the pattern has length m 
and the text length n). Perleberg's algorithm, by maintaining the next length 
table, only requires 0(m\E\) space, 0(m\S\+m 2 ) preprocessing time, and 0(n) 
running time. 

Using relator extension as described in §3, we implemented Perleberg's algo- 
rithm at the match-level. For each pattern relator R pi we build two automata, 
one for R' p and one for R~ x . Even with a change to the heuristic strategy of 
Tictze processes to reduce the amount of substring searching, we found the 
automaton-based method to be slow. Thus, automaton searching takes 226 sec- 
onds for J, compared with 37 seconds for a brute-force variant with change 
flags; for T, it is 1,157 seconds as against 105. It takes too long to build the 
two automata for each pattern relator. For J, there are 20,654 pattern relators 
in the whole run, for which automata construction takes 163 seconds, which is 
72% of the total time. For T, there are 26,431 pattern relators, and automata 
construction takes 75% of the the total time. 

We can reduce the number of automata needed to one per search by using 
the equation ComStr(R p , Rt) — {com_substr(R' pl R' t ); com. substr(R' p , Rt 1 );}. 
In this alternative, we replace a situation with two pattern strings and one text 
string by one with one pattern string and two text strings. Since we need one 
automaton per pattern, the time taken building automata is reduced by a factor 
of about two. Even though the preprocessing time is reduced, it is still far too 
much for our applications. The preprocessing time alone is still much more than 
the total time spent in the alternative method. 



6 Conclusions 

We have studied the substring searching component of presentation manipula- 
tion algorithms used in computational group theory. It differs from other string 
searching problems. We gave a formal definition of the problem and developed a 
two level searching model. We presented two timestamp algorithms at the first 
level and proved minimal-cover theorems associated with them. At the second 
level, we investigated methods based on signatures, fingerprints, Bloom filters, 
and automata. Detailed experiments revealed that different methods have ad- 
vantages in different stages of the processes. 
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